Introduction

On the three Gradients cruises we have observed rapid changes in the abundance (cells / µL) of Prochlorococcus, Synechococcus, and picoeukaryotes along the cruise tracks. We would like a clearly-defined procedure for defining where these changes occur. Many methods can be used to define these break points, for example, by looking at changes in the first and second derivatives, or by statistical tests to fit a piecewise constant or linear function to the abundance data. The derivative method runs into some difficulties due to small-scale variability in the abundance data; it’s not clear how much to smooth the data before computing derivatives and its possible to either detect changes that we would like to ignore or smooth out changes that appear visually important. Other problems arise with the statistical fitting of piecewise functions.

In this report I demonstrate an approach which ignores the temporal and spatial dimension of the data. We decompose the distribution of log abundances into a set of Gaussian distributions. The idea is that prevaling conditions in different parts of the cruise track will lead to high or low abundance, and various levels in between, of each taxonomic group. Abundance of any one species is generally log-normal, but if conditions change markedly, we observe large increases or decreases in the abudance. The goal of this method is to isolate those different states by decomposing abundance into a set of log normal distributions. The method requires the selection of the number of peaks, but there are statistical tools that can be used to select this number automatically. In general I’ve found that a user-imposed smaller than “optimal” number of peaks is easier to interpret and describes the breaks observed visually quite well.

Goals

  • Demonstrate the idea of decomposing the distribution of log abundance into a sequence of Normal (Gaussian) peaks.
  • Perform this decomposition on the three taxa from the three Gradients cruises
  • Quantify the location of the breakpoints along the ship transect (as latitude and time, with separate results for northbound and southbound trips)

Data

We obtain the data from CMAP and store a local copy. If you set downloadData to false, you will need a local copy of the file; if true, then you will need an API key for CMAP.

Pick out just the Gradients cruises

  • 1 - KOK1606
  • 2 - MGL1704
  • 3 - KM1906

Plot the abundance data to see what we’ve got.

## Don't know how to automatically pick scale for object of type difftime. Defaulting to continuous.
## Warning: Removed 841206 rows containing missing values (geom_point).

## Warning: Removed 841206 rows containing missing values (geom_point).

Developing the analysis

Here I work throught the analysis on a single species for a single cruise.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## ------------------------------------------------------- 
## Density estimation via Gaussian finite mixture modeling 
## ------------------------------------------------------- 
## 
## Mclust E (univariate, equal variance) model with 2 components: 
## 
##  log-likelihood    n df       BIC       ICL
##       -27616.82 6202  4 -55268.57 -55289.72
## 
## Mixing probabilities:
##         1         2 
## 0.1087832 0.8912168 
## 
## Means:
##          1          2 
##   7.852853 118.838953 
## 
## Variances:
##        1        2 
## 218.4532 218.4532

Plot the abundance along the cruise track coloring the data by cluster number. Strange things can happen if you use the “wrong” number of clusters. For example, if you repeat this analysis using 3 clusters (G=1:3) you will get a “better” model (larger BIC), but cluster 2 and 3 will overlap, one narow and one wide.

How to find the location of the breakpoints? Perform a running-median smooth on cluster number, picking the majority winner, and find where those change using first-order differences. Plot full cruise and a zoom in on the area of interest.

## Warning: Removed 5308 rows containing missing values (geom_point).

## Don't know how to automatically pick scale for object of type difftime. Defaulting to continuous.

## Warning: Removed 5338 rows containing missing values (geom_point).

## Don't know how to automatically pick scale for object of type difftime. Defaulting to continuous.

## Warning: Removed 5338 rows containing missing values (geom_point).

The location of the breaks.

All the cruises

Write functions to perform clustering, find breakpoints, and make summary table.

find_clusters = function(select_cruise = "KOK1606", select_spp = "prochloro", G = 1:2, seaflow = seaflow_subset) {
  seaflow_subset %>%
    filter(cruise == select_cruise, spp == select_spp) %>%
    mutate(cruise_time = time - min(time)) %>%
    na.omit() -> temp1
  temp1 %>%  pull(abundance) %>% 
    densityMclust(., G=G, model=c("E", "V")) -> model1
  temp1 %>%
    mutate(cluster = predict(model1, what="z") %>% apply(1, function(x) which(x==max(x)))) %>%
    arrange(time) %>%  # important for the data to be in chronological order before smoothing with a running median
    mutate(cluster_median = factor(runmed(cluster, 151))) -> temp2  # 3 removes single isolated changes
  temp2
}
find_breaks = function(clusters) {
  clusters %>% filter(cluster_median != lag(cluster_median)) %>%
    group_by( floor(cruise_time / (60*60))) %>%  # cluster any breaks within the same hour (approximately, definitely half hour)
    # group_by(floor(lat*10)) %>%
    summarize(time = mean(time), lat = mean(lat)) %>%
    select(time, lat)
}
make_plots = function(cruise = "KOK1606", species = "prochloro", G = 1:2, seaflow = seaflow_subset) {
  temp1 <- find_clusters(cruise, species, G, seaflow)
  temp2 <- find_breaks(temp1)
  p1 <- ggplot(temp1, aes(x=lat, y=abundance, color = factor(cluster_median))) +
           geom_point() +
           geom_vline(data=temp2, aes(xintercept = lat))
  p2 <- ggplot(temp1, aes(x=time, y=abundance, color = factor(cluster_median))) +
           geom_point() +
           geom_vline(data=temp2, aes(xintercept = time))
  p3 <- ggplot(temp1, aes(x=time, y=lat, color = factor(cluster_median))) +
           geom_point() +
           geom_vline(data=temp2, aes(xintercept = time))
  # p1 + p2 + p3
  print(p1)
  print(p2)
  print(p3)
  temp2
}
# find_clusters(cruise="MGL1704", species = "prochloro") %>% find_breaks()

Gradients 1

Prochlorococcus

Synechococcus

Picoeukaryotes

Gradients 2

Prochlorococcus

Synechococcus

Picoeukaryotes

Gradients 3

Prochlorococcus

Synechococcus

Picoeukaryotes